Improving the reliability of cohesion policy databases

In this contribution, we present an innovative data-driven model to reconstruct a reliable temporal pattern for time-lagged statistical monetary figures. Our research cuts across several domains regarding the production of robust economic inferences and the bridging of top-down aggregated information from central databases with disaggregated information obtained from local sources or national statistical offices. Our test bed case study is the European Regional Development Fund (ERDF). The application we discuss deals with the reported time lag between the local expenditures of ERDF by beneficiaries in Italian regions and the corresponding payments reported in the European Commission database. Our model reconstructs the timing of these local expenditures by back-dating the observed European Commission reimbursements. The inferred estimates are then validated against the expenditures reported from the Italian National Managing Authorities (NMAs) in terms of cumulative monetary difference. The lower cumulative yearly distance of our modelled expenditures compared to the official European Commission payments confirms the robustness of our model. Using sensitivity analysis, we also analyse the relative importance of the modelling parameters on the cumulative distance between the modelled and reported expenditures. The parameters with the greatest influence on the uncertainty of this distance are the following: first, how the non-clearly regionalised expenditures are attributed to individual regions; and second, the number of backward years that the residuals of the yearly payments are spread onto. In general, the distance between the modelled and reported expenditures can be further reduced by fixing these parameters. However, the gain is only marginal for some regions. The present study paves the way for modelling exercises that are aimed at more reliable estimates of the expenditures on the ground by the ultimate beneficiaries of European funds. Additionally, the output databases can contribute to enhancing the reliability of econometric studies on the effectiveness of European Union (EU) funds.


Introduction
In this contribution we propose a data-driven model to estimate the actual economic time series from those reported in official centralised databases and benchmark them against bottom-up evidence coming from local statistical offices. The test bed case study is the ERDF, which aims to strengthen economic, social, and territorial cohesion in the EU by correcting interregional imbalances. ERDF is part of the European Structural and Investment Funds (ESIF), which represent roughly two-thirds of the whole European budget, amounting to more than 1 trillion euros for the period 2014-2020 (approximately 1% of the EU-28 gross national income). For the same peoriod, the ERDF allocated more than 220 billion euros in investments to diverse areas.
Additionally, a new temporary instrument, the NextGenerationEU, amounting to 750 billion euros has been introduced for the period 2021-2027 to ease the recovery of EU countries from the COVID-19 pandemic. The extraordinary circumstances that have arisen due to this challenge further emphasise the importance of understanding the spending pattern of public resources [1], particularly so as to inform the public about the allocations of EU tax-payers' money and their benefits.
Every Member State (MS) in the EU is obliged to record and account for the expenses incurred, but the complete records are neither (always) publicly available nor directly comparable across individual MSs. For this reason, a spatially and temporally homogeneous expenditure database at the European level will always be lacking. This is precisely why resorting to modelling these expenditures can help to alleviate the issue.
The European Commission (EC)-managed database suffers from an inevitable time lag because EC payments are reimbursed only after the incurred expenses have been invoiced. Precisely, the final beneficiaries invest on the ground and produce invoices to the NMAs, which, in turn, certify the share of eligible expenditures and produce an invoice to the EC. Even if this process runs smoothly (i.e., the relevant documents are promptly processed at each stage, no accounting mistakes are made, payments are not suspended following audit inspections, and so on), a substantial time lag may occur from the incurred expenditure on the ground to the moment when the EC payment to the NMAs is actually recorded. Therefore, it is very challenging to produce econometric inferences regarding the benefits of EU cohesion policy funding on the ground.
The model we discuss in this contribution estimates the expenditures incurred by European regions from their observed EC reimbursement pattern. To the best of our knowledge, it is the first time this approach is attempted. Our modelled expenditures are validated against the expenditures reported by the Italian authorities, which have recently become available.
Econometricians can make use of the output database of modelled expenditures to generate more robust inferences regarding the benefits of EU cohesion policy in a number of fields, including convergence analysis [2][3][4][5][6] and in terms of the effects on GDP [7,8], employment [7], or local administrative and governance capacity [9][10][11] among many others.
The next section presents the data and methods used for the present study. In addition, we demonstrate how a Wasserstein measure [12], which gives the distance between two curves, is a convenient proxy to drive inferences on the observed reimbursement patterns and to model the target pattern of expenditures.

Materials and methods
Data for ESIF EC payments to MSs and regional authorities were obtained from the EC Directorate-General for Regional and Urban Policy (DG REGIO). In every five-year financial programming period of the EU, each individual MS systematically records the requests for payments (and related invoices) made by individual beneficiaries. However, these documents are neither standardised, or necessarily accessible, nor contain the same information or structure across regions or MSs. Furthermore, the number of primary sources amounts to more than 300 because the individual NMAs store their own records, which has made unsuccessful the attempts to build these data bottom up due to the unreliable and incomplete information across MSs. The current case study is grounded on the "official" database at the EC level. This dataset consists of around 500,000 entries, which include the yearly payments to the Operational Programmes (OPs) over four overarching funding schemes (Cohesion Fund (CF), European Agricultural Fund for Rural Development (EAFRD), ERDF, European Social Fund (ESF)) and five programming periods (1989-1993, 1994-1999, 2000-2006, 2007-2013, 2014-2020). OPs are the reference unit and have variable geographical scope: regional, multi-regional, national, or across multiple MSs. The EC does not directly collect information disaggregated at the regional level.

PLOS ONE
The pre-processing stage of this study involved regionalising EC payments over the 280 regions in Nomenclature of Territorial Units for Statistics (NUTS), known as NUTS 2. Over time, the number and borders of NUTS2 regions have changed to reflect countries joining or leaving the EU, as well as administrative readjustments within individual MSs. For this reason, during pre-processing, we harmonised the nomenclature to the NUTS2 2010 version, where each code maps uniquely onto a specific region. The detailed regionalisation procedure is described in Wishlade et al. [13] (work-package 13) and Lo Piano et al. [14]. Every entry in this regionalised EC dataset of payments corresponds to an amount reimbursed to a given regional authority over a specific year, funding scheme, and programming period. For the funding scheme ERDF, non-regionalised figures were attributed on a pro-capita basis to individual NUTS2 areas for those countries that had not broken down the information at the NUTS2 level. If the region's level of development was specified, a pro-capita attribution was performed among the NUTS2 areas with the same level of development.
In this contribution, we cover the funding scheme ERDF reimbursed over the programming period 2007-2013. We tested the modelled expenditures against the actual incurred expenditures for a pioneer MS, Italy (IT), whose data were provided by the Italian managing authority for this programming period and funding scheme. These data are available from https://opencoesione.gov.it/en/ and also included uncertified expenditures (i.e., figures that have not yet been accounted for in the consolidated EC payments database).
Sensitivity analysis [15][16][17] was used to explore how uncertainty in the input variables affected the output variable. Our analysis was performed within a factor-prioritisation [18] and direction-of-change setting; that is, we aimed to identify the key drivers that influenced the output variable and the impact associated with fixing these drivers, respectively. One of the most widely adopted approaches to sensitivity analysis is the variance-based approach, wherein output uncertainty is measured in terms of the statistical moment variance, which is eventually apportioned to the input parameters. Another class of sensitivity methods, the moment-independent sensitivity measures, do not resort to a particular statistical moment. As we will see in the next section, our output Quantity of Interest (QoI) is the distance between the cumulative distributions of the expenditures, which naturally supports the selection of a moment-independent sensitivity measure. Several moment-independent measures have been proposed in the literature (see Borgonovo [19], Plischke, Borgonovo & Smith [20], and Pianosi & Wagener [21]). We hereby adopt the δ moment-independent sensitivity measure [19,20] to evaluate how the modelling parameters influence uncertainty regarding the distance between the modelled and the reported expenditures.

Use of distance measures to estimate expenditure patterns
Consider a generic region p whose expenditures can be reimbursed over k eligible years. Let us also introduce a dummy region against which we can benchmark the reported reimbursement patterns. The dummy region has a constant spending and reimbursement pattern; specifically, it spends a certain amount and is reimbursed for the same amount each year until the last eligible year of the programming period. For instance, if reimbursements can occur over 10 years, the dummy region is reimbursed 10% each year of the total budget it has been granted.
For the sake of comparability, reimbursements are normalised across regions. The normalised cumulative expenses of the dummy region over k years can be expressed as per Eq 1: Analogously, one can define the regional reimbursement pattern for a generic region p as follows: One may also define an equivalent of the Wasserstein metric in probability theory as the distance μ p between the cumulatives of these two curves [12] according to Eq 3: To assess the time specificity of the reimbursement trend rather than the simple divergence from the regular dummy pattern, it is possible to address the plain difference in line with the Kruglov distance [12]. This is denoted by m s p and can be expressed as follows: These measures are complementary: μ p acknowledges the divergence from a constant spending pattern, although it does not allow to grasp its time specificity (i.e., early versus late reimbursement pattern); conversely, m s p addresses this specificity, although it suffers from compensatory effects across years (e.g., a positive difference in the year i would be cancelled out by an equal negative difference in the year i + 1). This makes it impossible to evaluate the precise reimbursement pattern at a yearly granularity through m s p . Let us now analyse the extreme case in which all the EC payments are reimbursed in the last eligible year of the programming period. In this case, the maximum difference μ p would be: Analogously, the assessment can be repeated for a hypothetical region whose expenditures are entirely reimbursed in the first year.
The figures shown in Eqs 5 and 6 define the maximum threshold for our measurement. The minimum value zero is attained for a hypothetical region whose reimbursement pattern is identical to the dummy region.
One can easily obtain the values for the m s p measurement, which is equivalent to μ p in the case of an early reimbursement pattern. The sign is the opposite in the case of extremely late reimbursement and amounts to À kÀ 1 2 . Overall, regions anticipating the dummy spending pattern have a positive sign, whereas the sign is negative for regions with a delayed reimbursement pattern.
The two measures are complementary: μ p measures the regularity of the reimbursement pattern, which is a property that policymakers will benefit from considering when managing the spending of granted financial resources. m s p points to the specificity of the regional reimbursement pattern over the course of the entire programming period (delayed vs. early).
We use m s p to define an index of regional specificity Ir that, in turn, enables us to rank regions and propose regional spending patterns, as detailed in the following section.

Taxonomy of cases
Let us consider the cumulative annual history of expenditures invoiced from the ultimate beneficiaries to the NMAs, MS p . E p is our modelled cumulative expenditure, namely the quantity with which we attempt to reproduce the pattern of MS p . As per Eq 7, the sum of the yearly figures over the entire programming period must be equal to assure consistency.
EC payments are spread over k years, while expenditures are not eligible after the (k − m) th year of the programming period.
A general rule is that reimbursements always follow expenses, which can be used as a basis for modelling the yearly expenditure patterns. A situation wherein cumulative modelled expenditures are smaller than the expenditures of NMAs would be wrong and would require the correction of our model.
For instance, let us take the first year of the programming period. In this year, the relation between the yearly figures must be: Furthermore, ms p 1 anticipates r p 1 , and the relation between these two quantities must also hold.
Finally, significant differences between ms p 1 and e p 1 would not be plausible. ms p 1 also accounts for invoices sent by local authorities (e.g., municipalities). This condition ensures that the time lag with e p 1 is minimal and, in particular, below the yearly granularity at which these figures have been produced. Therefore, the relation between the two figures should be: We can also extend these relations to the l th year, with l � k − m, as per the following: Therefore, our workflow will firstly focus on evaluating the closure relation as per Eq 8, although uncertified expenditures may result in discrepancies of variable magnitude.

Modelling the incurred expenditures
In this contribution, we use an adaptation of the model to estimate local expenditures from the reported EC payments [14,22,23] and validate the ERDF figures for Italy against those of the national managing authority. The model was developed based on joint reflections among practitioners in technical fields such as modelling and data analysis, as well as practitioners involved in the operation of EU regional policy programmes. The rationale of the model is to project the reimbursed payments backwards to capture the actual temporality of the reimbursed financial resources, along with their effects on the local receiving areas.
An overarching assumption of the model is that each yearly expenditure corresponds only to a fraction of the payment reimbursed in the same year. The complementary fraction of this payment is attributed to expenditures incurred over the previous year(s).
Yearly expenditures are estimated by ranking each EU region against a dummy region, as discussed in the previous subsections. In turn, a coefficient of regional specificity Ir p is calculated from the ranks of μ s over each individual funding scheme and programming period. This coefficient is used to define the spending pattern of regions. The higher its value, the greater is the delay that characterises the region's reimbursement pattern compared to the dummy region's constant reimbursement pattern.
Feature scaling, also known as min-max normalisation, was performed on the Ir p series. This leads to a value of 0 for the region with the earliest reimbursement pattern and 1 for the latest.
For instance, consider Ir p = 0.6 for the ERDF funding scheme over the 2007-2013 programming period. This implies that the payment reported in the last eligible year of expenditures is attributed to expenditures that were also incurred over a maximum of intðð2017 À 2007Þ � Ir 2007À 2013;ERDF p Þ ¼ 6 previous years. The uncertain parameters in the model, as shown in Table 1, are: • Maximum share of payment attributed to an expenditure incurred on the same year ϕ max • Minimum share of payment attributed to an expenditure incurred on the same year ϕ min • Number of Years of expenditures that the residual payment can be attributed to backwards, which is defined as intðk À 1Þ � Ir pp;fs p for a generic region p over the programming period pp and the funding scheme fs The other uncertain parameter is a binary trigger related to non-attributed NMAs expenditures. These are expenditures that do not map onto specific NUTS2 areas. This regional reattribution is proportional either to the funds reimbursed on the year of the unattributed payment or to the funds reimbursed over the whole programming period. Table 1 specifies the uncertainty range of these input parameters.
The range of the distributions of the input parameters was selected after consultation with practitioners directly involved in EU regional policy programmes. For continuous variables, uniform distributions were conservatively adopted due to the absence of information concerning the individual probability of values across the range.
The share of the payment attributed to an expenditure occurring in the same year is denoted by ϕ p , which is expressed as follows: The quantity ϕ p is constant across years. Also, the resulting expenditure incurred over year i is equal to: The greater the value of Ir pp;fs p , the lower the share of the payments attributed to expenditures incurred on that specific year. The rationale for this hypothesis is that a later reimbursement pattern most likely results from conspicuous time lags between the incurred expenditures and the reimbursed payments.
The residual of the payment (i.e., the fraction of the payment not attributed to expenditures incurred in the same year) is spread onto the previous years as per the third uncertain parameter, Years. To demonstrate, consider the task of attributing the residual of the 2017 payment over the three previous years. In this case, a fraction of the payment reimbursed in the year 2017 is attributed to expenditures incurred in the years 2016, 2015, and 2014 under the assumption that these are halving each preceding year: specifically, 4 7 of the residual of the 2017 payment is attributed to expenditures incurred in 2016, 2 7 in 2015, and 1 7 in 2014. The rationale behind this assumption is that the magnitude of a payment is most likely more strongly correlated with more recent expenditures.
The assumption is also made that the total number of backward years of expenditures that the residual of the payment is attributed to is correlated over the programming period; in particular, we assume that it decreases by one year each preceding year. In this way, if the residual of the 2017 payment is attributed to expenditures incurred over the three previous years, this quantity would only amount to two for the residual of the payment reimbursed in the year 2016, and one for 2015. The minimum number of backward years is one, and this quantity is kept constant for all the years backwards up to the second year of the programming period once this threshold is met. Payments reimbursed in the first year of the programming period are entirely attributed to expenditures incurred in the same year.
In total, 2 17 (* 130, 000) Monte Carlo simulations were performed by sampling the uncertainty parameters from these distributions through quasi-random LP τ Sobol' low-discrepancy sequences [24]. This sample size was selected to ensure the convergence of the sensitivity indices (see the convergence plots in S1-S20 Figs in S1 File. The rationale for using Monte Carlo simulations was to generate a population of expenditure distributions and to evaluate their reliability against figures reported by the NMAs. The Python scripts are available as Jupyter Notebooks on GitHub.
The Python scripts thoroughly describe the preparation and curation of the dataset for the comparison, as well as the uncertainty analysis performed. The output variable is the cumulative distance m s IT between the reported IT expenditure and the modelled expenditure previously introduced. After consulting with practitioners involved in EU regional policy programmes, the assumption was made that the last year of eligible expenditures was 2017.
To understand how the uncertainty of the input parameters reflected onto the output uncertainty moment-independent sensitivity analysis was performed using a Matlab 1 Betaks3 subroutine. In moment-independent sensitivity analysis, the sensitivity measure is the distance between the unconditional and conditional distributions of the uncertain parameters. The δ sensitivity measure [19,20] is calculated for the four input parameters shown in Table 1. The logic behind δ is the following: one factor is fixed to a value, and the difference between the curve obtained by fixing this factor and the standard output curve is measured. If the factor is important, fixing it will tangibly affect the output in terms of curve shape. The experiment is repeated by fixing the factor at different values over its range of variability until an average difference is obtained.

Results and discussion
The distance against the reported MS expenditures for the modelled expenditures and payments for Italy (IT) is illustrated in Fig 1. Mismatches between the certified and uncertified data amount to approximately 10% of the former.
In Fig 1, the boxplots are always below the threshold range defined by the distance with the reported EC payments (black rectangles), with no or a (mostly) minor degree of By examining the shapes of the distributions, it is possible to identify multi-peak and/or multi-modal distributions for certain regions. The most emblematic cases are Molise (Mls), Campania (Cmp), Friuli Venezia Giulia (FVG), and Laz, as shown in Fig 2. By contrast, Lombardy (Lmb), Apulia (Apl), Emilia-Romagna (EmR), and Tuscany (Tsc) show the most normal distributions, but they still exhibit some degree of skewness (Fig 2).
The remaining cases are provided in S21 Fig in S1 File. By examining the regional Ir p , it is not possible to identify a precise correlation between this value and the distribution's width, skewness, or level of multi-modality.
Uncertainty analysis enables the investigator to apportion the output uncertainty onto the input parameters. The use of moment-independent sensitivity analysis is a fortiori justified by the skewed and multi-model distributions obtained, which implies that variance is a potentially poor measure of output uncertainty in these settings. The values of the sensitivity measure δ are reported in Fig 3. The higher the value, the more influential the uncertainty of the input parameter on the output uncertainty.
The variable Residual selector was identified as the most influential parameter for 10 of 20 Italian regions, Years for 8, and ϕ max for the remainder. Figs 4 and 5 show an example for each of the two first cases, respectively. The charts for the other regions can be found in S22-S39 Figs in S1 File. In Fig 4, one can appreciate how the different values of the trigger Residual selector 'activate' each part of the bi-modal-shaped distribution. The trend is similar in Fig 5, where the distinct sub-components of the output distribution are activated by different values of Years (Fig 5). In the y given ϕ min sub- Figure, one can also appreciate the importance of this parameter, given the wide range explored in the output shape upon varying this continuous parameter in its range.
This information can be used to inform a factor-prioritisation strategy [15] focused at reducing the QoI (i.e., the cumulative distance with the reported MS expenditures) by fixing the most influential parameters to the value that would minimise this quantity. At this point,   let us assume that it is possible to choose either one option or the other for the trigger Residual selector. On doing so, the option of attributing the residual according to the reported regional yearly expenditures will lead to a lower distance in the case of FVG (Fig 6). This trend is shared by 16 of the 20 Italian regions. The trajectory is similar if the other parameter with the greatest influence, Years, is fixed towards its higher end, as indicated in the example of Clb in Fig 6. This trend is shared by 18 of the 20 Italian regions.
When fixing both these parameters at their optimal values, one obtains a reduction for 18 of the 20 regions. Notably, however, Laz is one of the two regions showing an opposite trend.
These findings are encouraging, yet the sample of regions investigated is too small to conclude that fixing these factors is an effective strategy to reduce the output variability and simplify the model developed: figures for more MSs would be needed to draw robust inferences that corroborates potential adjustments.
Existing interactions among the parameters justifies the choice of global sensitivity analysis over a plain one-variable-at-a-time approach, that would overlook them. Using variance-based  sensitivity analysis [15], where the sensitivity metric is the variance of the output cumulative distance, more than 10% of the output variance would not be attributed for two regions: Liguria (Lgr) and Cmp. Only 80% of the output variance is apportioned when neglecting interactions among factors for the latter, while this is only 55% for the former, for which interactions among factors are responsible for 45% of the output variance.

Conclusions
This study presented a model to infer incurred expenditures for European regions from the reported reimbursement pattern of EC funds. The data-driven model elaborated more reliable local time-resolved figures based on the patterns reported in official centralised figures. In our study, we also validated the output time-resolved database through benchmarking against the local official statistics. We showcased an application of our model for the ERDF by considering the example of Italy for the programming period 2007-2013.
The work presented here aligns closely with several of the core principles of the European Statistics Code of Practice [25], where the common quality framework of the European Statistical System for the National Statistical Authorities and Eurostat is defined. In particular, it is consistent with Principle 4, Commitment to Quality, because it develops procedures to monitor and improve the quality of the statistical data processes, also favouring the integration of data from multiple sources. It is also aligned with Principle 12, Accuracy and Reliability, through its assessment and validation of source data, integrated data, intermediate results, and statistical outputs.
The model implemented in this study can move from EC payments to simulated expenditures in a way that matches the actual expenditures on the ground as closely as possible. Testing involved comparing the cumulative trends of the modelled and reported expenditures for the Italian regions. Perfect closure against this benchmark was not possible due to around 10% uncertified expenditures. Reducing this gap would enable analysts to assess the quality of the modelling activity performed more effectively. This modelling exercise would need to be extended across EU regions to backstop its findings, although it will never be perfect due to the existence of multiple intervening factors (e.g., payment suspensions, issues with the management of individual OPs or with the processing of the data at any level, and strategic decisions for reporting expenditures). Nonetheless, the figures produced in this study at the individual NUTS2 level for IT proved to be quite reliable.
Uncertainty and sensitivity analyses enabled this study's assessment of the range of uncertainty in the temporal discrepancies between the reported and modelled expenditures, as well as its apportionment onto the input parameters and assumptions, respectively. The most impactful assumptions turned out to be the following: first, how the non-clearly regionalised expenditures are spread onto individual regions, Residual selector; and second, the number of backward years onto which the residual of the yearly payments is spread, Years. We re-ran our simulation by keeping these variables fixed at precise points in the admissible range. The results of the sensitivity analysis for the two parameters show that both increasing the number of years of backward shift and attributing the non-clearly regionalised expenditures (on the basis of yearly regional expenditures) may improve the match between the modelled and measures expenditures. However, the gain was marginal for some regions and the fit was worse for 2 regions out of 20.
These findings are encouraging given the simplicity of the model developed. Further sophistication may be adopted to better reproduce the MS expenditures reported, as well as to corroborate this factor-prioritisation strategy.
The following are the principal take-home messages for different stakeholders: • The EC may expand this research to other funding schemes and programming periods, including the current recovery funds for member states.
• Further MSs can make their figures available to strengthen the findings presented here. Standardised tools and forms for registering all payments would be helpful, along with data access for the researcher, ideally through central storage at the DG REGIO.
• The model's validation against MSs figures provides a new database that econometricians can use to generate less time-lagged and more reliable estimates of the benefits of European funds on the ground. •